Investigating Bimodal Clustering in Human Mobility 
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Abstract — We apply a simple clustering algorithm to a large 
dataset of cellular telecommunication records, reducing the 
complexity of mobile phone users' full trajectories and allowing 
for simple statistics to characterize their properties. For the 
case of two clusters, we quantify how clustered human mobility 
is, how much of a user's spatial dispersion is due to motion 
between clusters, and how spatially and temporally separated 
clusters are from one another. 
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I. Introduction 

Recently, much effort has been devoted to understanding, 
mapping, and modeling large-scale human and animal tra- 
jectories [1]— [12]. Examples include models that describe 
agents searching a space for a target of unknown position 
[5], [6], [8]; mobility tracking in cellular environments [10], 
[13], [14]; human infectious disease dynamics and mobile 
phone viruses [9], [12], [15]— [17]; traffic transportation, and 
urban planning [2], and references therein. Understanding 
spatiotemporal patterns and characterizing human mobility 
and social interactions can now be achieved due to extensive 
and widespread use of wireless communication devices [10]. 

There is growing experimental evidence that the move- 
ment of many species, including humans, can be described 
by a class of non-trivial random walk models known as 
Levy flights [1], [2], [18]. A Levy flight can be considered 
a generalization of Brownian motion [18] and belongs to 
the class of scale-invarient, fractal random processes. Levy 
flights are Markovian stochastic processes in which step 
lengths A are drawn from a power law distribution: 

X(x) ~ 1/M Q+1 , (1) 

where < a < 2 is the Levy exponent. This implies that 
the second moment of A diverges and extremely long jumps 
are possible. For review, see [19]. 

Levy statistics, somewhat controversially, have been found 
in the search behavior of many species including human 
hunter-gatherers [20]. Levy flight-like movement patterns 
have been observed while tracking dollar bills [1], mobile 
phone users [2], and GPS trajectory traces obtained from 
taxicabs and volunteers in various outdoor settings [7], [21]. 

The mobile phone users studied in [2] reveal behavior 
similar to Levy patterns, but individual trajectories show 



a high degree of temporal and spatial regularity. That 
investigation analyzed the trajectories of 10 5 anonymized 
phone users, randomly selected from more than six million 
subscribers. The unique mobility patterns found in [2] show 
a time-independent characteristic travel distance and high 
probability to frequently return to a few locations. Addi- 
tionally, real user's mobility patterns may be approximated 
by Levy flights but only up to a distance characterized 
by the radius of gyration r g . This quantity represents the 
characteristic distance travelled by a user a observed up to 
time t, defined as 
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where r° represents the i — 1, . . . , n"(t) positions recorded 
for user a and r^, M = l/n"(f) 2~2i=i r i ^ s tne center of 
mass of the trajectory [2]. Interestingly, it has been observed 
that the radius of gyration increases only logarithmically in 
time, which cannot be explained by traditional Levy models; 
therefore, we must return to the data for further analysis. 

A. Motivation and Open Questions 

The work in [2] showed that human mobility patterns are 
well characterized by the radius of gyration r g . This is a 
static measure, in the sense that the order of movement 
between locations is irrelevant, and can then characterize 
time-invariant properties of the trajectory. Mobile phone 
data only samples the actual underlying trajectory, however, 
with a user-driven, heterogeneous sampling rate [2], and 
this complicates the study of the user's real mobility. In 
the same way that r g avoids these sampling problems 
by "integrating" over time, an appropriate spatial course- 
graining can provide a basic picture of the time-dependent, 
evolving characteristics of a subject's mobility pattern. 

In this paper, we apply a simple clustering algorithm to 
the spatial locations of a user's trajectory. Finding clusters 
of frequently visited locations (such as home and work) and 
collapsing them to a single entity reduces the complexity 
of the full trajectory while allowing for simple statistics 
to capture properties relating to how users move between 
locations. Interesting questions include: 



1) How spatially separated are such clusters? 

2) How often are clusters (re-)visited? How long do users 
dwell within clusters? 

3) Are larger clusters (more recorded calls over time) 
more spatially dispersed (as quantified by the r g of 
the cluster's elements) than smaller clusters? How do 
the r g 's of clusters relate to the total r g l 

The remainder of this paper is organized as follows. We 
first briefly discuss the dataset (Sec. II-Bb and the clustering 
algorithm used to analyze it (Sec. II-O . We then introduce 
several important statistics, calculate them from the dataset 
at hand, and discuss their implications (Sec. |n). A summary 
and discussion of future work follows (Sec. UTTb - 

B. Data Set 

As in [2], [17] we analyze data from a European mobile 
phone carrier. The data contains the date and time of phone 
calls and text messages from 6 million anonymous users as 
well as the spatial location of the phone towers routing these 
communications. User locations within a tower's service area 
are not known. From this full dataset, we select a random 
subset of 60 000 users that make or receive at least one phone 
call during June - August 2007. The call history of each 
user was then used to reconstruct their trajectory of motion 
during that time period. 

C. Clustering 

Our analysis is based on fc-means clustering which, in 
general, divides TV-dimensional populations (or observa- 
tions) into k distinct sets or clusters by minimizing the 
intra-cluster sum of squares w 2 of an appropriately defined 
distance metric. Using the notation in [22] (see also [23]): 
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where p is the probability mass function for the observations 
S = {S%, S2, ■ ■ ■ ,Sk}, and m (i = l,...,k) is the 
conditional mean of p over the set Si. For this work, ^ will 
be the center of mass of cluster i and we seek to find the 
locations of /ij that minimize the square Euclidean distance 
between and that cluster's recorded call locations. 

For simplicity, we assume k = 2 clusters throughout 
this work. This is a serious assumption, and identifying 
the correct number of clusters on a per-user basis remains 
important future work. However, we have found that the 
majority of users are well clustered with k = 2. To show 
this, we compute the mean silhouette value [24] for each 
user. Define A(yi) as the average square (Euclidean) distance 
from Yi to all other points in the same cluster, and define 
B(ri) as the average square distance from r\; to all points 
in the other cluster. The silhouette value s(rj) for point 

B(Ti)-A(Ti) 



and takes values between -1 and 1, with larger values 
indicating is increasingly well separated from the other 
cluster. Taking the average (s) = (s(i"j))- over all points 
then provides a single statistic measuring how well the whole 
data are clustered. Poor choices of k, for example, lead to 
smaller (s) [24]. We find that 91.8% of users have (s) > 0.8 
and 80.8% have (s) > 0.9, indicating that the majority of 
users are well clustered with just k = 2. 

II. Results 

For each user, we apply the fc-means algorithm to their 
trajectory, partitioning the call locations into k = 2 sets. The 
number of calls in clusters 1 and 2 for user a are Ni (a) and 
A^(a), respectively (we identify Ni(a) > A^a) such that 
cluster 1 is the primary cluster) and Nt(o) — Ni(a)+N2(a) 
is the total number of calls that user a makes during the 
sample period. The distribution P(N) of the number of calls 
is shown in Fig. Q] 

Using the spatial distribution of calls we compute each 
cluster's center of mass, and r^, and radius of 

gyration, ri and r„ 2 \ as well as the total center of mass and 

(T) IT) 

radius of gyration for all points, Yqm> an( ^ r a > respectively. 
The distributions of these quantities are shown in Fig. |2] 

To quantify relationships between the two clusters, we 
compute the separation between their centers of mass, dcM- 
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where ||. . .|| is the Euclidean norm, and we also count how 
often a user "jumps" between the two clusters. A user who 
makes Nt calls will have Nt — 1 jumps between locations 
(including remaining at the current location). We define -Fee 
as the fraction of cross-cluster jumps, those that begin and 
end in different clusters. The distributions of Fee and dcM 
are shown in the insets of Figs. [T]and|2] respectively. 

The number of calls per cluster indicates that users spend 
the majority of their time in one cluster and visit the other 
cluster more rarely. The fraction of cross-cluster jumps is 
small, (Fcc) uscrs is less than 0.1, (where (• • .) uscrs is an 
average over all sampled users), indicating that the primary 
cluster provides a stable location in which the user dwells. 
Likewise, d C u is relatively large, (d CM ) uscrs = 157.8 km, 
indicating that we are finding a semi-frequent but long- 
distance destination. It would be interesting to see temporal 
dependencies on the cluster's occupation probability: are 
users more likely to be in the secondary cluster on weekends, 
for example. 

The cluster's radius of gyration summarizes how compact 
or dispersed user movement is within that cluster. We find 
that the larger cluster (in terms of the number of calls) tends 
to be slightly more spatially compact than the smaller cluster, 
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are much smaller than r, 
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which is to be expected when dcM is large. This means that 
much of the user's total radius of gyration is generated by 
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Figure 1: Temporal properties of clusters. Shown is the 
distribution P(N) of the number of phone calls inside 
each cluster and the total number of calls, N\, N2, and 
Nt — Ni + N2, respectively. The majority of calls take 
place in cluster 1, but a non-negligible amount occur in 
cluster 2. (inset) The fraction of jumps Fee from one cluster 
to another, quantifying how often users travel between their 
clusters. The primary cluster tends to contain the majority 
of calls and users tend to move between clusters somewhat 
rarely, (i*bc) useni = 0.098, though some move much more 
frequently. 




Figure 2: Spread and separation of trajectory clusters. 
Shown is the distribution P(r g ) of the radii of gyration 

(1) (2) (T) 

for both clusters and over all points, r g ', r g , and r g , 
respectively. The secondary cluster tends to be slightly 
more spatially dispersed than the primary cluster. Lines 
are truncated power laws of the form (r g + r°) ^ T er Ta ^ K 
characterized by parameters = (r°, (3 r ,K). For the above 
curves, 6i = (5.5,1.5,70), 9 2 = (0.75,0.9,70), and 
0t = (15,1.4,260), for cluster 1, cluster 2, and both, 
respectively, (inset) The distribution P(cIcm) of distances 
between cluster's centers of mass, over all users. The straight 
line is an exponential distribution with mean A -1 = 157.8 
km, indicating that clusters are often well separated, but 
distances fall off rather quickly. 



movement between two well-separated clusters, as opposed 
to homogeneous motion over a large space. 

III. Conclusions and Future Work 

We have applied a simple fc-means clustering algorithm 
to a large sample of human trajectories generated from 
mobile phone records. Doing this characterizes how users 
move within their set of visited locations and we find 
that people tend to have one dense, primary cluster and 
one secondary, dispersed cluster. Course-graining a user's 
trajectory into clusters also quantifies how often users move 
between clusters and we find that users spend the majority 
of their time in the primary cluster but visit the secondary 
cluster semi-frequently. The clusters themselves tend to be 
well separated, indicating that the secondary cluster is a 
long-range destination, but the distribution of these distances 
over all users falls off exponentially quickly, compared to 
the total radius of gyration. 

The most important avenue for future work involves relax- 
ing the assumption of k = 2 clusters. While mean silhouette 
values have shown that the data are well characterized by 
two clusters, it remains to be seen if introducing more 



clusters improves the picture. Furthermore, since so much 
of a typical user's time is spent in the primary cluster, 
there remains the tantalizing possibility that further sub- 
structure is present within it. In other words, the secondary 
cluster may represent infrequent long-range trips while the 
primary cluster may represent the union of home and work 
clusters, or home and school. Information about important 
routines such as daily commuting may be contained within 
the primary cluster. 
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